Sjoerd Beetsma, Maarten de Jeu Class V2A - Group 5
Like mentioned in research iteration 1, we have divided our 3 research questions across 3 datasets. The first 2 research questions correspond to the Chemical datasets, and the last questions corresponds to the review dataset.
Throughout this notebook, we'll attempt to stick to the iterative CRISP-DM workflow as much as possible, and divide our notebook into chapters corresponding to phases from this philosophy. This chapter division is only a rough indication of what can be found where though, because the highly iterative workflow requires us to look back/forward in the process quite frequently.
Specifically, everything you're looking for can be found in this order:
The context for our research questions is quite vague from the side of our client, but we'll be examining the following questions:
The dataset for the first research question is aquired from https://archive.ics.uci.edu/ml/datasets/wine+quality along a dataset about red-wine there is also a dataset about white-wine, the first will be used for our first research question and the datasets combined will be used for our second research question.
The variables in the chemical datasets are:
1 - fixed acidity: Fixed acids found in wines are tartaric, malic, citric, and succinic.
2 - volatile acidity: Steam distillable acids like acetic acid, lactic, formic, butyric, and propionic acids.
3 - citric acid: One of the fixed acids found in a wine
4 - residual sugar: Residual sugar in a wine is the natural grape sugars leftover in a wine after the fermentation is finished.
5 - chlorides: Chlorides are a major conributor to the saltiness of a wine.
6 - free sulfur dioxide:The part of sulfur that remains free after the sulfur is binded
7 - total sulfur dioxide:Free sulfur dioxide plus binded sulfur dioxide.
8 - density: Density of a wine, usually increased by fermentable sugars.
9 - pH: pH to specify the acidity/basicity of a wine, most wines ranging with a pH from 3 to 4.
10 - sulphates: Use of sulpates slows down the growth of yeasts keeping and prevents the oxidation keeping the wine fresh for longer.
11 - alcohol: Alcohol percentage of a wine
12 - quality: A quality score between 0 and 10, based on sensory data
The source of the data also states that they don't know if all variables are relevant in deciding the quality score of a wine.
Source:
Paulo Cortez, University of Minho, Guimarães, Portugal, http://www3.dsi.uminho.pt/pcortez A. Cerdeira, F. Almeida, T. Matos and J. Reis, Viticulture Commission of the Vinho Verde Region(CVRVV), Porto, Portugal @2009
We import some libraries and the dataset to examine the data through code.
import numpy as np
import pandas as pd
import sklearn as sk
# Many specific imports now, so code is more readable for non-programmers later.
from sklearn.feature_selection import RFE
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import Ridge, Lasso, LinearRegression
from sklearn.model_selection import train_test_split
from sklearn import neighbors, metrics, naive_bayes, cluster
import plotly.express as px
import py_lib
import matplotlib.pyplot as plt
import seaborn as sns
from mpl_toolkits.mplot3d import Axes3D
%matplotlib inline
For the research question one we will only need the red-wine dataset. The second research question requires a combined dataset with labels added for the color of the wine (white or red) the two datasets will be explored and cleaned at the same time but separately, performing similar operations on both. Merging the two datasets before the data-cleaning would result in records being designated outliers that shouldn't be, correlations not being detected in certain datasets that should be, etc. We'll merge the two datasets after cleaning them for use in the second research question.
Let's load in both red and white wine datasets
dataset_red = pd.read_csv("datasets/winequality-red.csv", sep=";")
dataset_white = pd.read_csv("datasets/winequality-white.csv", sep=";")
Let's take the head of one of the chemical property datasets to have a first look at the submissions of the data.
dataset_red.head()
| fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
| 1 | 7.8 | 0.88 | 0.00 | 2.6 | 0.098 | 25.0 | 67.0 | 0.9968 | 3.20 | 0.68 | 9.8 | 5 |
| 2 | 7.8 | 0.76 | 0.04 | 2.3 | 0.092 | 15.0 | 54.0 | 0.9970 | 3.26 | 0.65 | 9.8 | 5 |
| 3 | 11.2 | 0.28 | 0.56 | 1.9 | 0.075 | 17.0 | 60.0 | 0.9980 | 3.16 | 0.58 | 9.8 | 6 |
| 4 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
As described by the source each row seems to correspond with a individual wine. With eleven different feature columns describing the chemical properties of the wine and one target column, quality.
To access columns easier in the future change white spaces in column names to underscores.
dataset_red.columns = dataset_red.columns.str.replace(' ','_')
dataset_white.columns = dataset_white.columns.str.replace(' ','_')
To see how much data entries we have we will check the amount of rows in both the red and white datasets.
red_rows, red_columns = dataset_red.shape
white_rows, white_columns = dataset_white.shape
print(f'The red-wine quality dataset contains {red_rows} rows and {red_columns} columns')
print(f'The white-wine quality dataset contains {white_rows} rows and {white_columns} columns')
The red-wine quality dataset contains 1599 rows and 12 columns The white-wine quality dataset contains 4898 rows and 12 columns
Lets change the column name white spaces to underscores to make life easier.
dataset_red.head(0)
| fixed_acidity | volatile_acidity | citric_acid | residual_sugar | chlorides | free_sulfur_dioxide | total_sulfur_dioxide | density | pH | sulphates | alcohol | quality |
|---|
For research question 1, all the columns describing chemical properties will be considered as a feature variable and the column quality represents the target variable, the variable we want to predict. Lets safe them in a variables to access them easily from the final dataframe.
The feature variables (at the start) for research question are the same ones as those for question 1, but will change after some investigation, and the target variable is currently not available as column (the eventual target variable will be whichever dataset the record is in now).
feature_vars_q1 = ['fixed_acidity', 'volatile_acidity', 'citric_acid', 'residual_sugar', 'chlorides', 'free_sulfur_dioxide', 'total_sulfur_dioxide', 'density', 'pH', 'sulphates', 'alcohol']
target_var_q1 = 'quality'
To choose a appropriate model for our research-questions and available data it's necessary to have a understanding of all the scales of measurements for all the relevant variables.
nomi, disc, ordi, cont = 'Nominal', 'Discrete', 'Ordinal','Continuous'
print('Scales of measurement chemical properties datasets')
pd.DataFrame(index=dataset_red.columns, data=[cont, cont, cont, cont, cont, cont, cont, cont, cont, cont, cont, disc],
columns=['Scale of measurement'])
Scales of measurement chemical properties datasets
| Scale of measurement | |
|---|---|
| fixed_acidity | Continuous |
| volatile_acidity | Continuous |
| citric_acid | Continuous |
| residual_sugar | Continuous |
| chlorides | Continuous |
| free_sulfur_dioxide | Continuous |
| total_sulfur_dioxide | Continuous |
| density | Continuous |
| pH | Continuous |
| sulphates | Continuous |
| alcohol | Continuous |
| quality | Discrete |
As can be seen all the chemical properties have continuous scale of measurement and the quality columns has a Discrete scale of measurement.
Using describe we can see some important central tendancies and dispersion measures about the dataset.
dataset_red.describe().round(2)
| fixed_acidity | volatile_acidity | citric_acid | residual_sugar | chlorides | free_sulfur_dioxide | total_sulfur_dioxide | density | pH | sulphates | alcohol | quality | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1599.00 | 1599.00 | 1599.00 | 1599.00 | 1599.00 | 1599.00 | 1599.00 | 1599.00 | 1599.00 | 1599.00 | 1599.00 | 1599.00 |
| mean | 8.32 | 0.53 | 0.27 | 2.54 | 0.09 | 15.87 | 46.47 | 1.00 | 3.31 | 0.66 | 10.42 | 5.64 |
| std | 1.74 | 0.18 | 0.19 | 1.41 | 0.05 | 10.46 | 32.90 | 0.00 | 0.15 | 0.17 | 1.07 | 0.81 |
| min | 4.60 | 0.12 | 0.00 | 0.90 | 0.01 | 1.00 | 6.00 | 0.99 | 2.74 | 0.33 | 8.40 | 3.00 |
| 25% | 7.10 | 0.39 | 0.09 | 1.90 | 0.07 | 7.00 | 22.00 | 1.00 | 3.21 | 0.55 | 9.50 | 5.00 |
| 50% | 7.90 | 0.52 | 0.26 | 2.20 | 0.08 | 14.00 | 38.00 | 1.00 | 3.31 | 0.62 | 10.20 | 6.00 |
| 75% | 9.20 | 0.64 | 0.42 | 2.60 | 0.09 | 21.00 | 62.00 | 1.00 | 3.40 | 0.73 | 11.10 | 6.00 |
| max | 15.90 | 1.58 | 1.00 | 15.50 | 0.61 | 72.00 | 289.00 | 1.00 | 4.01 | 2.00 | 14.90 | 8.00 |
From the describe we can tell that there are quite a few columns with a big difference between maximum and minimum values and also a low standard deviation which indicate outliers in for example: Residual_sugar, chlorides, free_sulfur_dioxide ,total_sulfur_dioxide.
dataset_white.describe().round(2)
| fixed_acidity | volatile_acidity | citric_acid | residual_sugar | chlorides | free_sulfur_dioxide | total_sulfur_dioxide | density | pH | sulphates | alcohol | quality | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 4898.00 | 4898.00 | 4898.00 | 4898.00 | 4898.00 | 4898.00 | 4898.00 | 4898.00 | 4898.00 | 4898.00 | 4898.00 | 4898.00 |
| mean | 6.85 | 0.28 | 0.33 | 6.39 | 0.05 | 35.31 | 138.36 | 0.99 | 3.19 | 0.49 | 10.51 | 5.88 |
| std | 0.84 | 0.10 | 0.12 | 5.07 | 0.02 | 17.01 | 42.50 | 0.00 | 0.15 | 0.11 | 1.23 | 0.89 |
| min | 3.80 | 0.08 | 0.00 | 0.60 | 0.01 | 2.00 | 9.00 | 0.99 | 2.72 | 0.22 | 8.00 | 3.00 |
| 25% | 6.30 | 0.21 | 0.27 | 1.70 | 0.04 | 23.00 | 108.00 | 0.99 | 3.09 | 0.41 | 9.50 | 5.00 |
| 50% | 6.80 | 0.26 | 0.32 | 5.20 | 0.04 | 34.00 | 134.00 | 0.99 | 3.18 | 0.47 | 10.40 | 6.00 |
| 75% | 7.30 | 0.32 | 0.39 | 9.90 | 0.05 | 46.00 | 167.00 | 1.00 | 3.28 | 0.55 | 11.40 | 6.00 |
| max | 14.20 | 1.10 | 1.66 | 65.80 | 0.35 | 289.00 | 440.00 | 1.04 | 3.82 | 1.08 | 14.20 | 9.00 |
Just like the red-wine dataset, the white-wine dataset has similar differences in maximum and minimum values.
Lets take a more visual look at the distribution of all data through a histogram for each of the feature attributes Starting off with the red wine dataset.
dataset_red.hist(figsize=(15,15))
plt.show()
In the the red-wine dataset pH and possibly density have a gaussian distribution. Other variables seem to have a skewed distribution.
Moving on to the white-wine dataset.
Moving on to the white wine dataset:
dataset_white.hist(figsize=(15,15))
plt.show()
In the the white-wine dataset only pH seems to have a gaussian distribution maybe density as well after removing outliers. Other variables have a skewed, possibly lognormal, distribution.
sns.countplot(data=dataset_red, x='quality').set_title('red-wine quality counts')
plt.show()
sns.countplot(data=dataset_white, x='quality').set_title('white-wine quality counts')
plt.show()
From the count plots above we can tell the quality for red-wines range from 3 to 8 with 5 being the most common quality rating. For white-wines it ranges from 3 to 9 with 6 being the most common quality rating.
To get a visual understanding of the outliers in the feature columns each feature gets a boxplotted with the Q1 target variable quality. Giving a small summary of the minimum, Q1, Q2 (median), Q3 and the maximum of each attribute plotted against quality scored to give a view of outliers at all quality levels.
def boxplotter(dataset, y_axes, x_axis):
"""Function to boxplot 1 x_axis against a list of y_axis of a given dataset"""
for col in y_axes:
sns.boxplot(x=dataset[x_axis], y=dataset[col])
plt.show()
First boxplot all the feature variables on the y-axis against the Q! target variable on the x-axis for the red-wine dataset.
boxplotter(dataset=dataset_red, y_axes=feature_vars_q1, x_axis=target_var_q1)
Now do the same for the white-wine dataset
boxplotter(dataset=dataset_white, y_axes=feature_vars_q1, x_axis=target_var_q1)
As can be seen from the boxplots all of our current variables contain outliers.
All outliers in the above boxplots seem to be plausible and not from incorrect data, like some attributes in iteration 1 had. From the boxplot with alcohol on the y axis and quality on the x axis we can see that a trend of a rising median alcohol percentage the higher the quality of the wine.
For later models it's important to know what variables have a (linear) correlation between each other. To find linear correlations and their direction/strength we make use of Pearson's correlation coefficient.
def corr_matrix_plotter(dataset, title=''):
"""Return a correlation matrix created using seaborn and matplotlib that for all columns in
a pandas dataframe.
Args:
dataset: Dataset to construct correlation matrix for.
title: Title of the plot.
Returns:
correlation matrix."""
corr = dataset.corr()
plt.figure(figsize=(10,7.5))
cmap = sns.diverging_palette(200, 0, as_cmap=True) # color palette as cmap
mask = np.logical_not(np.tril(np.ones_like(corr))) # triangle mask
sns.heatmap(corr, annot=True, mask=mask, cmap = cmap, vmin=-1, vmax=1).set_title(title) # correlation heatmap
plt.show()
corr_matrix_plotter(dataset_red, 'Red wine correlation matrix.')
corr_matrix_plotter(dataset_white, 'White wine correlation matrix.')
In the correlation matrices graphed above you can see which attributes have a correlation to other attributes. Starting with Q1 our target variable 'quality', we can see quality has a few correlations with the strongest one being alcohol for both red and white wines and a few weaker ones like volatile acidity, sulphates and citric acid for red wines and density and chloride for white wines. Because quality is our Q1 target variable it's the independent attribute in the correlations.
Quality is dropped for research question 2, so for that it's not relevant.
Besides there are some correlations among chemical properties: Fixed acidity has strong correlation with pH, but it’s still an independent type. pH However is a dependent type; it depends on the former. Volatile acidity, residual sugar, sulphates, chlorides, and density are all independent data types. Total sulfur dioxide is dependent on free sulfur dioxide, but free sulfur dioxide is independent.
Strong correlations along features might need to be removed during the data preparation phase.
Lets start of the data preparation by checking the datatypes and clean or change them if necessary.
Red-wine quality datatypes:
dataset_red.dtypes
fixed_acidity float64 volatile_acidity float64 citric_acid float64 residual_sugar float64 chlorides float64 free_sulfur_dioxide float64 total_sulfur_dioxide float64 density float64 pH float64 sulphates float64 alcohol float64 quality int64 dtype: object
White-wine quality datatypes:
dataset_white.dtypes
fixed_acidity float64 volatile_acidity float64 citric_acid float64 residual_sugar float64 chlorides float64 free_sulfur_dioxide float64 total_sulfur_dioxide float64 density float64 pH float64 sulphates float64 alcohol float64 quality int64 dtype: object
These all seem to be in order, so we can now move on to checking and removing or replacing any NA values in the datasets.
pd.isna(dataset_red).sum().sum() # checking for total NA values in red_wine
0
pd.isna(dataset_white).sum().sum() # checking for total NA values in white_wine
0
But luckily, all data seems to be complete, so no need to worry about that either.
A remove outliers function is created but and used to remove extreme outliers. This is will make a regression algorithm peform better because the RMSE is sensitive to outliers.
Removing all extreme the outliers leaving the mild ones in the dataset with a outer fence. 3 IQR below Q1 and 3 IQR above Q3.
def remove_outliers(dataset, fence = 3):
"""
Function to remove any rows containing one or more outliers in a dataframe.
Default fence of 3 * IQR detecting extreme outliers.
args:
dataset: Pandas Dataframe or Series
fence: Number deciding the fence range to use to detect outliers.
returns:
The passed dataset minus the rows containing outliers
"""
q1 = dataset.quantile(.25)
q3 = dataset.quantile(.75)
iqr = q3 - q1
return dataset[(dataset >= q1 - (fence * iqr)) & (dataset <= q3 + (fence * iqr))].dropna() # turn extreme outliers into NaN values and drop the rows
The red-wine dataset contained 1599 rows and the white-wine 4898 before removing the outliers lets remove the outliers and check how many are left.
dataset_red = remove_outliers(dataset_red)
dataset_white = remove_outliers(dataset_white)
dataset_red.shape, dataset_white.shape
((1435, 12), (4690, 12))
1435 of the 1599 rows are left in the red-wine data, 12% of the rows contained extreme outliers.
And in the white-wine data 4690 of the 4898 rows are left, 4% of the rows contained extreme outliers.
Duplicate values would probably cause overfitting in the worst case, and imbalanced models in the best case. Let's check if either dataset contains any.
dataset_white.duplicated().sum()
915
dataset_red.duplicated().sum()
218
Both contain duplicates that need to be removed, so let's get rid of them.
dataset_white.drop_duplicates(inplace=True)
dataset_red.drop_duplicates(inplace=True)
The red and white wine chemical property datasets are cleaned but still needs normalizing. Because we don't need normalized data for our first research question and only for our second one we will only normalize the red and white combined dataset.
Now both the red and white dataset are ready to be combined into one dataset with a extra column for being red or not.
dataset_red['is_red'] = 1
dataset_white['is_red'] = 0
dataset_red_white = pd.concat([dataset_red, dataset_white])
dataset_red_white.sample(10,random_state=2)
| fixed_acidity | volatile_acidity | citric_acid | residual_sugar | chlorides | free_sulfur_dioxide | total_sulfur_dioxide | density | pH | sulphates | alcohol | quality | is_red | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1518 | 7.4 | 0.47 | 0.46 | 2.20 | 0.114 | 7.0 | 20.0 | 0.99647 | 3.32 | 0.63 | 10.5 | 5 | 1 |
| 1449 | 6.9 | 0.19 | 0.49 | 6.60 | 0.036 | 49.0 | 172.0 | 0.99320 | 3.20 | 0.27 | 11.5 | 6 | 0 |
| 767 | 7.5 | 0.60 | 0.32 | 2.70 | 0.103 | 13.0 | 98.0 | 0.99938 | 3.45 | 0.62 | 9.5 | 5 | 1 |
| 506 | 10.4 | 0.24 | 0.46 | 1.80 | 0.075 | 6.0 | 21.0 | 0.99760 | 3.25 | 1.02 | 10.8 | 7 | 1 |
| 365 | 10.0 | 0.42 | 0.50 | 3.40 | 0.107 | 7.0 | 21.0 | 0.99790 | 3.26 | 0.93 | 11.8 | 6 | 1 |
| 50 | 8.8 | 0.66 | 0.26 | 1.70 | 0.074 | 4.0 | 23.0 | 0.99710 | 3.15 | 0.74 | 9.2 | 5 | 1 |
| 1193 | 7.3 | 0.22 | 0.40 | 14.75 | 0.042 | 44.5 | 129.5 | 0.99980 | 3.36 | 0.41 | 9.1 | 7 | 0 |
| 992 | 6.5 | 0.40 | 0.10 | 2.00 | 0.076 | 30.0 | 47.0 | 0.99554 | 3.36 | 0.48 | 9.4 | 6 | 1 |
| 198 | 7.1 | 0.32 | 0.24 | 13.10 | 0.050 | 52.0 | 204.0 | 0.99800 | 3.10 | 0.49 | 8.8 | 5 | 0 |
| 116 | 8.3 | 0.54 | 0.28 | 1.90 | 0.077 | 11.0 | 40.0 | 0.99780 | 3.39 | 0.61 | 10.0 | 6 | 1 |
Machine learning algorithms using a euclidean distance function benefit from working with normalized data.
We'll define a function that normalizes a pandas Series or Dataframe object. We won't use it for now, because it best to only do it when needed, but it's good to have.
def normalize(s: pd.Series or pd.DataFrame) -> pd.Series or pd.DataFrame:
"""Normalize (standardize) a pandas series or dataframe object by subtracting the mean and dividing by the
standard deviation.
Args:
s: series object to normalize.
Returns:
normalized series object."""
return (s - np.mean(s)) / np.std(s)
We now have the dataset_red, dataset_white, and dataset_red_white Dataframes, which are all cleaned. Nothing is normalized as of yet, but we have defined a procedure to do this with.
The datasets will be split into a train and test dataset for the models to learn and test their performance.
X, y = dataset_red[feature_vars_q1], dataset_red[target_var_q1]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
scores = pd.DataFrame()
We start out by creating a baseline model that always predicts the mean of the target variable in the training set, so that we can attempt to improve this score with other models. Let's check the RMSE of this model.
baseline = py_lib.DumbRegressor(y_train)
base_line_predictions = baseline.predict(X_test)
plt.plot(base_line_predictions, c='r')
plt.scatter(x=np.arange(len(X_test)), y=y_test, s=4)
plt.xlabel("Wine index")
plt.ylabel("Quality score")
plt.legend(["Baseline prediction", "Actual value"], loc="upper left")
plt.show()
baseline_rmse = sk.metrics.mean_squared_error(y_test, base_line_predictions, squared=False)
scores = scores.append({'algorithm':'Baseline', 'RMSE': baseline_rmse},ignore_index=True)
The baseline model scores a RMSE of 0.81 quality points, this score is what we want to improve upon with further models.
For our first machine learning model we will implementing the simplest form of multiple feature regression: multiple linear regression. This model doesn't have any hyper-parameters to tune. Instead of hyper-parameter tuning we use a Feature-selection algorithm.
def feature_elimination(X_train, y_train, n_features, model):
"""
Function using RFE (recursive feature elimination) to pick n best features from train set.
args:
X_train: X training set
y_train: y training set
model: model to do RFE for.
n_features: number of features to be left with after elimination
returns:
Boolean mask
"""
selector = RFE(model, n_features_to_select=n_features)
selector.fit(X_train, y_train)
return selector.support_
# multiple linear regression
lin_regr = LinearRegression()
# keep track of the number of features, the resulting from RMSEs from the model and the column masks
n_features = []
RMSEs = []
masks = []
# for each possible number of available features do recursive feature selection
for i in range(1,len(feature_vars_q1)+1, 1):
leftover_features = feature_elimination(X_train, y_train, i, lin_regr)
X_f, y_f = dataset_red[feature_vars_q1].loc[:, leftover_features], dataset_red[target_var_q1]
X_train_f, X_test_f, y_train_f, y_test_f = train_test_split(X_f, y_f, random_state=0)
lin_regr.fit(X_train_f, y_train_f)
lin_regr_model_score = sk.metrics.mean_squared_error(lin_regr.predict(X_test_f), y_test_f, squared=False)
# update list
n_features.append(i)
RMSEs.append(lin_regr_model_score)
masks.append(leftover_features)
plt.plot(n_features, RMSEs, 'bx-')
plt.title('Number of features used in model vs RMSE')
plt.xlabel("Number of features used")
plt.ylabel("RMSE")
plt.show()
From our feature selection we can see the big improvements starting with just one feature and increasing to the 6 best features. From 6 to our total number of features 11 we just see a very small improvement. To keep the complexity of the model down and emphasize the most important features we will be using the 6 best ones according to the recursive feature elimination.
selected_features = pd.Series(X.loc[:, masks[6-1]].columns) # 6 nr of features - 1 for correct index
Let's check the 6 selected features.
print(selected_features)
0 volatile_acidity 1 chlorides 2 density 3 pH 4 sulphates 5 alcohol dtype: object
Feature variables: volatile_acidity, chlorides, density, pH, sulphates and alcohol. These 6 variables will be used in our linear model.
Fit the model to the X train and y train data
lin_regr = lin_regr.fit(X_train[selected_features], y_train)
Get the mean_squared_error to see how good the prediction is vs the actual values.
Lets see how multiple linear regression compares to the baseline model. Get the RMSE to decide the performance of the linear model.
lin_regr_model_score = sk.metrics.mean_squared_error(lin_regr.predict(X_test[selected_features]), y_test, squared=False)
scores = scores.append({'algorithm':'multiple linear regression', 'RMSE': lin_regr_model_score},ignore_index=True)
print(f"Linear model scores a RMSE of {round(lin_regr_model_score,3)}")
Linear model scores a RMSE of 0.636
The multiple linear regression scored a RMSE 0.63 compared to our baseline this is about a 22% lower RMSE, meaning the baseline has already been improved and surpassed by the first model. But it's not quite as low as we want it yet. Lets try to improve more with polynomial regression. Through polynomial regression we can see if the number of polynomials will result in a better peformance. A polynomial model with just 1 degree is the same as linear regeression.
For our hyper-parameter, the degree of polynomials we will be trying 1 to 7 degrees.
poly_r_degrees = np.arange(1,6,1)
Make the model
poly_r_make_train_model = np.vectorize(lambda d: make_pipeline(PolynomialFeatures(d),LinearRegression()).fit(X_train[selected_features], y_train))
Train a model for every degree earlier specified degree.
poly_r_models = poly_r_make_train_model(poly_r_degrees)
Let's see how well the polynomial regression peformances with different degrees of polynomials. For polynomial regression we will also be using the 6 RFE selected features.
poly_r_models_scores = np.vectorize(lambda m: sk.metrics.mean_squared_error(m.predict(X_test[selected_features]), y_test, squared=False))(poly_r_models)
scores = scores.append({'algorithm':'Polynomial regression', 'RMSE': np.min(poly_r_models_scores)},ignore_index=True)
plt.plot(poly_r_degrees, poly_r_models_scores, 'bx-')
plt.title('Polynomial degree VS RMSE score')
plt.xlabel('Polynomial degree')
plt.ylabel('RMSE')
plt.show()
for degree, rmse in zip(poly_r_degrees, poly_r_models_scores):
print(f"degree {degree}, score {round(rmse,3)}")
degree 1, score 0.636 degree 2, score 0.633 degree 3, score 0.656 degree 4, score 0.968 degree 5, score 2.135
Polynomial regression peformed best with a polynomial degree of 2. But didn't any noteable improvents from 1 to 3 degrees of polynomials.
We have implemented multiple linear and polynomial regression in combination with recursive feature elimination. Our models peformed better then the baseline by about 22% improvent in the measured RMSE.
Let's try 2 other regression algorithms, Lasso and Ridge regression.
Ridge regression will reduce the impact of features that are not important when predicting the target value. Lasso regression can eliminate many features ignoring the ones that don't have significant impact.
For Ridge regression we'll be trying out alphas 0 to 2 in steps of .2.
ridge_r_alphas = np.arange(0,2,.2)
ridge_r_make_train_model = np.vectorize(lambda a: Ridge().set_params(alpha=a).fit(X_train, y_train))
Make and train the models with different alphas.
ridge_r_models = ridge_r_make_train_model(ridge_r_alphas)
Test the models their peformance by predicting on the test dataset getting the RMSE from the prediction and the actual values.
ridge_r_models_scores = np.vectorize(lambda m: sk.metrics.mean_squared_error(m.predict(X_test), y_test, squared=False))(ridge_r_models)
scores = scores.append({'algorithm':'Ridge regression', 'RMSE': np.min(ridge_r_models_scores)},ignore_index=True)
plt.plot(ridge_r_alphas, ridge_r_models_scores, 'bx-')
plt.title('Ridge regression alphas VS RMSE score')
plt.xlabel('Alpha')
plt.ylabel('RMSE')
plt.show()
for alpha, rmse in zip(ridge_r_alphas, ridge_r_models_scores):
print(f"alpha {round(alpha, 2)}, RMSE {round(rmse,3)}")
alpha 0.0, RMSE 0.631 alpha 0.2, RMSE 0.63 alpha 0.4, RMSE 0.629 alpha 0.6, RMSE 0.629 alpha 0.8, RMSE 0.63 alpha 1.0, RMSE 0.63 alpha 1.2, RMSE 0.63 alpha 1.4, RMSE 0.63 alpha 1.6, RMSE 0.63 alpha 1.8, RMSE 0.63
Ridge regression has the best peformance at a alpha at alpha of .4 measuring a RMSE of 0.63
Lastly lets try Lasso regression.
Alphas that will are 0.001 to .1 with steps of 0.01
lasso_r_alphas = [.0001, .5, 10]
Make the model
lasso_r_make_train_model = np.vectorize(lambda a: Lasso().set_params(alpha=a).fit(X_train, y_train))
Train the model with the different alphas.
lasso_r_models = lasso_r_make_train_model(lasso_r_alphas)
Get the RMSE to decide the peformance of the linear model
lasso_r_models_scores = np.vectorize(lambda m: sk.metrics.mean_squared_error(m.predict(X_test), y_test, squared=False))(lasso_r_models)
scores = scores.append({'algorithm':'Lasso regression', 'RMSE': np.min(lasso_r_models_scores)},ignore_index=True)
plt.plot(lasso_r_alphas, lasso_r_models_scores, 'bx-')
plt.title('Lasso regression alphas VS RMSE score')
plt.xlabel('Alpha')
plt.ylabel('RMSE')
plt.show()
for alpha, rmse in zip(lasso_r_alphas, lasso_r_models_scores):
print(f"alpha {round(alpha, 2)}, RMSE {round(rmse,3)}")
alpha 0.0, RMSE 0.631 alpha 0.5, RMSE 0.788 alpha 10, RMSE 0.805
Lasso regression has the best RMSE score when using a very low alpha, in this case .0001. Lasso regression didn't improve upon Ridge regression.
Let's investigate the Ridge model more to see which coefficients had the most impact in the prediction.
Take out the best peforming Lasso model.
q1_model = ridge_r_models[np.argmin(ridge_r_models_scores)]
Making a dataframe with the features and their corresponding coefficient and sort them.
df_coeff = pd.DataFrame(zip(X_train.columns, q1_model.coef_))
df_coeff.columns = ['feature', 'coefficient']
df_coeff = df_coeff.sort_values('coefficient')
And plot them in a barplot to get a visual representation of the impact of each feature variable.
plt.figure(figsize=(10,10))
plt.bar(x=df_coeff['feature'], height=df_coeff['coefficient'], alpha=1)
plt.title('Ridge regression feature variables impact')
plt.xlabel('Feature variable')
plt.ylabel('Coefficient')
plt.xticks(rotation = 30)
plt.show()
From the barplot can be seen chlorides and volatile_acidity have the most negative impact, and sulphates has by far the most positive impact.
As reminder in the feature selection for multiple linear and polynomial regression we came to following 6 feature variables:
volatile_acidity, chlorides, density, pH, sulphates and alcohol.
While in the Ridge regression model the 6 most impactful feature variables are:
volatile_acidity, chlorides, citric_acid, pH, sulphates and alcohol.
Since the most impactful 2-pair of feature variables from the Ridge regression: chlorides and volatile acidity. Overlap with the features obtained from RFE. Ridge regression's predictions will be visualized.
Let's predict on the test data with the Ridge regression model.
y_predict = pd.DataFrame(q1_model.predict(X_test), columns=['quality'])
X_y_predictions = X_test.reset_index().join(y_predict)
By making a regression plot between the actual quality of the test set against the predicted quality we can see how well our model did. The tighter the spread of the dots around the line the better more accurate the model predicts.
sns.regplot(x=y_test, y=y_predict, ci=None)
plt.title('Regressionplot on predicted and actual red wine quality')
plt.xlabel('Actual quality')
plt.ylabel('Predicted quality')
plt.show()
To see the linear correlation in the model the most impactful two-pair feature variables will be plotted against the target variable. In our case chlorides and volatile acidity together have a strong negative impact on the target variable quality. Let's plot them in a 3d scatter plot.
fig = px.scatter_3d(X_y_predictions, x='chlorides', y='volatile_acidity', z='quality', color='quality')
fig.update_traces(marker=dict(size=5,opacity=0.6))
fig.update_layout(scene = dict(
zaxis = dict( title='Quality')))
For our first research question:
With what accuracy can we decide the quality of a red-wine according to its chemical properties?
We ended up implementing 4 models: Linear regression, Polynomial regression, Ridge regression and Lasso regression.
To be reminded of each of their peformance let's see the RMSE of each model:
scores
| RMSE | algorithm | |
|---|---|---|
| 0 | 0.805480 | Baseline |
| 1 | 0.636232 | multiple linear regression |
| 2 | 0.632558 | Polynomial regression |
| 3 | 0.629454 | Ridge regression |
| 4 | 0.630738 | Lasso regression |
Compared to our RMSE of 0.81 from the baseline each model has around a 20% lower RMSE. With Ridge regression scoring the lowest RMSE by a very small margin compared to the other implemented models.
In conclusion: We can predict the quality of a redwine with a RMSE of .63 which is a significant improvement over the baseline but not quite the peformance to predict the quality with a small margin of error. We came to the findings that there are 2 chemical properties: chlorides and volatile_acidity that cause the biggest negative impact on the quality of a redwine. Sulphates and alcohol on the otherhand seem to have a positive impact on the quality.
For convenience, we'll add an overview of the 'combined' dataset. Then we'll split the dataset into a test and train portion. As a reminder: we'll be using the chemical properties as feature variables to try to determine whether a wine is red of white from these. The target variable is 'is_red', a nominal(boolean) variable with 2 possible values. We'll also standardize the feature variables, because we're using a model with euclidean distance measurements after all.
dataset_red_white
| fixed_acidity | volatile_acidity | citric_acid | residual_sugar | chlorides | free_sulfur_dioxide | total_sulfur_dioxide | density | pH | sulphates | alcohol | quality | is_red | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.99780 | 3.51 | 0.56 | 9.4 | 5 | 1 |
| 1 | 7.8 | 0.88 | 0.00 | 2.6 | 0.098 | 25.0 | 67.0 | 0.99680 | 3.20 | 0.68 | 9.8 | 5 | 1 |
| 2 | 7.8 | 0.76 | 0.04 | 2.3 | 0.092 | 15.0 | 54.0 | 0.99700 | 3.26 | 0.65 | 9.8 | 5 | 1 |
| 3 | 11.2 | 0.28 | 0.56 | 1.9 | 0.075 | 17.0 | 60.0 | 0.99800 | 3.16 | 0.58 | 9.8 | 6 | 1 |
| 5 | 7.4 | 0.66 | 0.00 | 1.8 | 0.075 | 13.0 | 40.0 | 0.99780 | 3.51 | 0.56 | 9.4 | 5 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 4893 | 6.2 | 0.21 | 0.29 | 1.6 | 0.039 | 24.0 | 92.0 | 0.99114 | 3.27 | 0.50 | 11.2 | 6 | 0 |
| 4894 | 6.6 | 0.32 | 0.36 | 8.0 | 0.047 | 57.0 | 168.0 | 0.99490 | 3.15 | 0.46 | 9.6 | 5 | 0 |
| 4895 | 6.5 | 0.24 | 0.19 | 1.2 | 0.041 | 30.0 | 111.0 | 0.99254 | 2.99 | 0.46 | 9.4 | 6 | 0 |
| 4896 | 5.5 | 0.29 | 0.30 | 1.1 | 0.022 | 20.0 | 110.0 | 0.98869 | 3.34 | 0.38 | 12.8 | 7 | 0 |
| 4897 | 6.0 | 0.21 | 0.38 | 0.8 | 0.020 | 22.0 | 98.0 | 0.98941 | 3.26 | 0.32 | 11.8 | 6 | 0 |
4992 rows × 13 columns
X, y = normalize(dataset_red_white[feature_vars_q1]), dataset_red_white['is_red']
X_train_red_white, X_test_red_white, y_train_red_white, y_test_red_white = train_test_split(X,y, random_state=0)
We'll start out by creating a simple model that always guesses the mode measurement of the target variable in the train dataset, and seeing what accuracy that gets, so that we have a score to attempt to improve upon.
baseline = py_lib.DumbNominalClassifier(y_train_red_white)
baseline_predictions = baseline.predict(X_test_red_white)
baseline_predictions[:2]
array([0, 0], dtype=int64)
The majority of the test dataset consists of white wine, so that's what we'll blindly be guessing with the baseline model.
metrics.accuracy_score(y_test_red_white, baseline_predictions)
0.7700320512820513
The baseline model is already able to get an accuracy of +/- 77.21% on the test dataset.
We'll try to improve on this, starting with a simple model and moving towards more complex ones. We'll start out using KNearestNeighbors because of it's simplicity, but because of the over-representation of white wine, we suspect other models might be able to do better.
Normally, one would build in business-knowledge to pick a few variables that might be relevant to start out with. In our case, we have no business expert, so we'll start with many attributes, and eliminate any unnecessary ones along the way.
Hyper-parameter wise, we'll start by experimenting with different values for 'k'. Our classes are way bigger, but we'll call the max sensible value for k 100 for now, starting at 1, with any odd values being fair game because we only have 2 target classes.
def knn_models(k_vals: np.ndarray,
X_train: pd.DataFrame,
y_train: pd.Series,
X_test: pd.DataFrame,
y_test: pd.Series) -> pd.DataFrame:
"""Create, fit and score KNN-models with different k-values for one dataset parallel to each other
through numpy vectorized operations.
Args:
k_vals: values for k to train a model with.
X_train: feature variables training set.
y_train: target variables training set.
X_test: feature variables test set.
y_test: target variables test set.
Returns:
Pandas dataframe that contains:
index column with every k-value.
'Model' column with the associated model.
'Score' column with the accuracy score that model got against the test-set."""
models = np.vectorize(lambda k: neighbors.KNeighborsClassifier(n_neighbors = k))(k_vals) # Initialize model objects.
models = np.vectorize(lambda m: m.fit(X_train, y_train))(models) # Train models.
scores = np.vectorize(lambda m: metrics.accuracy_score(y_test, m.predict(X_test)))(models) # Make predictions for test set and score.
return pd.DataFrame({"Model": models, # Create and return Dataframe.
"Score": scores},
index=k_vals)
ks = np.arange(1, 101, 2)
models = knn_models(ks, X_train_red_white, y_train_red_white, X_test_red_white, y_test_red_white)
plt.plot(models.index, models["Score"], 'bx-')
plt.title("Accuracy per K- value.")
plt.show()
It appears (k < 20 ∧ k > 0) is the most promising area, so let's have a look at the exact values.
models.loc[:20, "Score"]
1 0.993590 3 0.995994 5 0.995192 7 0.995994 9 0.995994 11 0.995192 13 0.995192 15 0.994391 17 0.993590 19 0.993590 Name: Score, dtype: float64
We're getting above 99% accuracy, with the peak around 99.60%. Out of all the choices with peak accuracy, it's good to pick a higher setting for k to avoid overfitting. k = 9 has this top score for instance, and would probably be a good choice for production.
99.6% is an incredible score, which would mean a model is terribly useful. So incredible, that we should probably consider the fact that something might have gone wrong. We split our data into a training and a test test:
print(f"""Data shapes:
Train Test
X: {X_train_red_white.shape} {X_test_red_white.shape}
y: {y_train_red_white.shape} {y_test_red_white.shape}""")
Data shapes: Train Test X: (3744, 11) (1248, 11) y: (3744,) (1248,)
We removed duplicates that could end up in both the train and the test dataset from the dataset in data preparation.
Perhaps certain chemical values really do have big discernible gaps between red and white wine. We know that most attributes have normal and lognormal distributions in either the red or the wine dataset individually, but one would expect there to be obvious differences between the distributions in either the red or wine dataset respectively.
columns = dataset_white.loc[:, :"alcohol"].columns # We don't need to look at quality or is_red.
fig = plt.figure(figsize=(15, 15))
fig.subplots_adjust(hspace=0.4)
fig.suptitle("Attribute value distributions in Red / White dataset", fontsize=16)
for i, col in enumerate(columns):
white_view, red_view = dataset_white[col].to_numpy(), dataset_red[col].to_numpy() # View of Numpy arrays contained in relevant cols.
# To enforce equal bins size between 2 histograms.
att_min = min(np.min(white_view), np.min(red_view))
att_max = max(np.max(white_view), np.max(red_view))
bins = np.linspace(att_min, att_max, 20)
#Actually making subplot
ax = fig.add_subplot(4, 3, i + 1)
ax.set_title(col)
ax.set_ylabel("Frequency")
ax.set_xlabel("Value")
ax.hist(dataset_white[col], alpha=0.5, bins=bins, color="yellow")
ax.hist(dataset_red[col], alpha=0.5, bins=bins, color="r")
ax.axvline(np.mean(white_view), color="darkorange", linestyle="dashed")
ax.axvline(np.mean(red_view), color="darkred", linestyle="dashed")
ax.legend(["White mean", "Red mean", "White frequency", "Red Frequency"], loc="upper right")
From this improved we can see that there are indeed some quite descriptive variables when it comes to predicting what color a wine is, which could explain the high accuracy.
We can divide the attributes into 3 categories based on these distributions (based on visual inspection, and no formal math or logic of any kind):
We'll synthesize some new Train/Test feature datasets from the old ones, based on the first category, to train some new models with to see if a stripped down model can perform just as well.
cat1 = ["volatile_acidity", "chlorides", "total_sulfur_dioxide"]
X_train_red_white_2, X_test_red_white_2 = X_train_red_white[cat1], X_test_red_white[cat1]
models = knn_models(ks, X_train_red_white_2, y_train_red_white, X_test_red_white_2, y_test_red_white)
plt.plot(models.index, models["Score"], 'bx-')
plt.title("Accuracy per K- value.")
plt.show()
models.loc[45:55, "Score"]
45 0.991186 47 0.991987 49 0.991987 51 0.990385 53 0.990385 55 0.990385 Name: Score, dtype: float64
We can see that a KNN-model based on only 3 attributes (volatile_acidity, chlorides and total_sulfur_dioxide) can still predict what color a wine is with 99.20% accuracy when k = 47 (in the test set), which could also be interesting for an eventual model in production.
Because 3 attributes turn out to be the most important, we can gain some insight into the actual look at the datapoints and their classifications with a 3D Scatterplot, with the 3 most important attributes for discerning a wine's color on the axes.
# Can't color properly with plotly with numeric target variable, so create temporary column with str value.
dataset_red_white["is_red_as_str"] = dataset_red_white["is_red"].astype(str)
fig = px.scatter_3d(data_frame=dataset_red_white,
x=cat1[0],
y=cat1[1],
z=cat1[2],
color="is_red_as_str",
color_discrete_map={"1": "rgb(255,0,0)",
"0": "rgb(255,255,102)"},
title="Most important attributes in discerning wine color.")
fig.update_traces(marker={'size': 2, 'opacity': 0.8})
fig.update_layout(legend= {'itemsizing': 'constant'})
# Drop pointless column
dataset_red_white.drop(columns="is_red_as_str", inplace=True)
Because we're dealing with so many Gaussian distributions, with distinctive differences between the target categories for some attributes, this problem could also be well suited for a Gaussian Naive Bayes Classifier.
Even though the model we have right now already performs quite well, it could be interesting to try this out in an attempt to reach 100% accuracy. We'll try a Gaussian Naive Bayes classifier with both all the chemical properties, and the stripped down chemical properties.
nb_model_complete = naive_bayes.GaussianNB()
nb_model_complete.fit(X_train_red_white, y_train_red_white)
nb_model_complete.score(X_test_red_white, y_test_red_white)
0.9919871794871795
nb_model_stripped = naive_bayes.GaussianNB()
nb_model_stripped.fit(X_train_red_white_2, y_train_red_white)
nb_model_stripped.score(X_test_red_white_2, y_test_red_white)
0.9783653846153846
With 99.1% accuracy with all feature variables, and 97.8% with the stripped down feature variable list, the Gaussian Naive Bayes classifier performs worse then K Nearest Neighbors in both scenarios, which is quite unexpected. A possible real-world advantage of this could be the fact that this classifier could return the probabilities for a given record being in either class, which could be useful.
With the use of a fairly simple algorithm like K Nearest Neighbors, one can easily classify the color of wine based on certain chemical properties with an accuracy above 99%, even when just looking at volatile acidity, chlorides and total sulfur dioxide. Simplicity seems to be key in this example, because a more complex model like a Gaussian Naive Bayes classifier performs every so slightly worse.
For our third research question we have a dataset about wine reviews from different smelters with information about the origin of the wine, price and their review.
With our research question being:
Can we distinguish between logical clusters of wineries? (Premium, budget, high-quality, etc...)
From our dataset's source ..., we have a list of the different attributes:
1 - country
2 - description
3 - designation
4 - points
5 - price
6 - province
7 - region_1
8 - region_2
9 - taster_name
10 - taster_twitter_handle
11 - title
12 - variety
13 - winery
The clearest way to explain what's contained in each of these columns is to look at them.
Let's load in the dataset and have a first look.
dataset_reviews = pd.read_csv("datasets/winemag-data-130k-v2.csv")
dataset_reviews.head(5)
| Unnamed: 0 | country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | Italy | Aromas include tropical fruit, broom, brimston... | Vulkà Bianco | 87 | NaN | Sicily & Sardinia | Etna | NaN | Kerin O’Keefe | @kerinokeefe | Nicosia 2013 Vulkà Bianco (Etna) | White Blend | Nicosia |
| 1 | 1 | Portugal | This is ripe and fruity, a wine that is smooth... | Avidagos | 87 | 15.0 | Douro | NaN | NaN | Roger Voss | @vossroger | Quinta dos Avidagos 2011 Avidagos Red (Douro) | Portuguese Red | Quinta dos Avidagos |
| 2 | 2 | US | Tart and snappy, the flavors of lime flesh and... | NaN | 87 | 14.0 | Oregon | Willamette Valley | Willamette Valley | Paul Gregutt | @paulgwine | Rainstorm 2013 Pinot Gris (Willamette Valley) | Pinot Gris | Rainstorm |
| 3 | 3 | US | Pineapple rind, lemon pith and orange blossom ... | Reserve Late Harvest | 87 | 13.0 | Michigan | Lake Michigan Shore | NaN | Alexander Peartree | NaN | St. Julian 2013 Reserve Late Harvest Riesling ... | Riesling | St. Julian |
| 4 | 4 | US | Much like the regular bottling from 2012, this... | Vintner's Reserve Wild Child Block | 87 | 65.0 | Oregon | Willamette Valley | Willamette Valley | Paul Gregutt | @paulgwine | Sweet Cheeks 2012 Vintner's Reserve Wild Child... | Pinot Noir | Sweet Cheeks |
At first look it appears most of the (usable) variables are nominal, with points and price as the only numerical (discrete) values. We have some columns that initially seem fairly useless for the types of analysis that we will most probably be using for this project, like 'description', but we'll keep them just in case we end up doing anything like a sentiment-analysis type model. It also appears we have a redundant index columns called "Unnamed: 0".
We're not quite sure what feature variables we'll be using for the third question, but we know we'll be grouping by 'winery'. We'll start out by using 'price' and 'points' as further feature variables, but during the modelling stage we might end up using more.
Considering we're looking for logical clusters (unsupervised learning), there are no target variables.
Like mentioned earlier, we're mostly dealing with categorical (nominal, specifically) variables in this dataset. There are 2 numerical values. Points and price are both discrete values.
We can examine the spread of values of the numerical variables through histograms:
_ = dataset_reviews[["price", "points"]].hist() # _ = to prevent pointless table from showing on screen.
Points has an obvious Gaussian distribution. It does appear the price graph is made quite unreadable by some outliers. We'll have a proper look at those later, let's ignore them for now to have a better look at the distribution.
dataset_reviews[dataset_reviews["price"] < 100]["price"].hist(bins=20)
<AxesSubplot:>
It appears that the price column in lognormally distributed.
Because we won't be comparing against something, we'll be using a seperate way of creating boxplots to explore outliers for the numeric values.
sns.boxplot(data=dataset_reviews[["points"]])
<AxesSubplot:>
It appears the median' amount of points is around 88, and quite symmetric. There are a couple of outliers around 100, but nothing extreme.
sns.boxplot(data=dataset_reviews[["price"]])
<AxesSubplot:>
The boxplot for price is barely a boxplot because of all the outliers. Like we already noticed with the histogram, most measurements fall within the 0-100 range, but there are some extremely high outliers. For data exploration, we'll create a separate column with the outliers removed for price.
dataset_reviews["price_no_outliers"] = remove_outliers(dataset_reviews["price"])
dataset_reviews["points_no_outliers"] = remove_outliers(dataset_reviews["points"])
We'll use the pearson correlation coefficient to see if there's a linear correlation between the only 2 numerical columns in the dataset. Because of the extreme outliers in price, it might be sensible to also check this with the outliers removed.
dataset_reviews[["points", "price"]].corr()
| points | price | |
|---|---|---|
| points | 1.000000 | 0.416167 |
| price | 0.416167 | 1.000000 |
dataset_reviews[["points_no_outliers", "price_no_outliers"]].corr()
| points_no_outliers | price_no_outliers | |
|---|---|---|
| points_no_outliers | 1.000000 | 0.549194 |
| price_no_outliers | 0.549194 | 1.000000 |
It appears there is a weak linear correlation between price and points, when disregarding outliers. Because we're looking for cluster, this is not terribly relevant, but still noteworthy.
Having explored the data, we're ready to clean it up. We have already separated the outliers in a separate column in the dataset, because that couldn't wait until data preparation. First, let's get rid of the unnamed index column, because that's not useful in any way.
dataset_reviews.drop("Unnamed: 0", axis=1, inplace=True)
dataset_reviews.dtypes
country object description object designation object points int64 price float64 province object region_1 object region_2 object taster_name object taster_twitter_handle object title object variety object winery object price_no_outliers float64 points_no_outliers int64 dtype: object
And let's change the categorical values from 'object', to 'string', to 'category'.
dataset_reviews = dataset_reviews.convert_dtypes()
categorical_vars = ["country", "description", "designation", "province", "region_1", "region_2", "taster_name", "taster_twitter_handle", "title", "variety", "winery"]
dataset_reviews[categorical_vars] = dataset_reviews[categorical_vars].astype("category")
dataset_reviews.dtypes
country category description category designation category points Int64 price Int64 province category region_1 category region_2 category taster_name category taster_twitter_handle category title category variety category winery category price_no_outliers Int64 points_no_outliers Int64 dtype: object
Our goal is to find logical clusters for different types of wineries, using a clustering algorithm. First, we create a separate dataframe from the original dataframe grouped by winery, with the mean value of the point/price value of that winery's wine as columns.
dataset_winery = dataset_reviews.groupby("winery")[["price_no_outliers", "points_no_outliers"]].mean()
dataset_winery.dropna(inplace=True)
dataset_winery
| price_no_outliers | points_no_outliers | |
|---|---|---|
| winery | ||
| 1+1=3 | 18.333333 | 86.666667 |
| 10 Knots | 24.75 | 83.25 |
| 100 Percent Wine | 18.0 | 86.333333 |
| 1000 Stories | 19.0 | 90.5 |
| 1070 Green | 25.0 | 88.0 |
| ... | ... | ... |
| Órale | 30.0 | 91.0 |
| Öko | 11.0 | 85.0 |
| Ökonomierat Rebholz | 58.25 | 91.5 |
| àMaurice | 38.75 | 90.55 |
| Štoka | 22.0 | 89.333333 |
15729 rows × 2 columns
For many clustering algorithms, it's useful to have the data normalized as well. Let's do that in seperate columns.
dataset_winery["price_normalized"] = normalize(dataset_winery["price_no_outliers"])
dataset_winery["points_normalized"] = normalize(dataset_winery["points_no_outliers"])
Let's have a look at the data:
plt.scatter(dataset_winery["price_normalized"], dataset_winery["points_normalized"], s=2)
<matplotlib.collections.PathCollection at 0x1b7e2b852e0>
So far, it mostly just looks like a thick cloud. Hopefully, a clustering algorithm will be able to see through the fog and give us more insight.
We'll start out by using kMeans on the (normalised) points and price because of it's simplicity, and move onto using more complex models and/or adding more data in case it doesn't give any useful results. Even though it's doubtful anything with k > 20 will be useful, we'll still try everything from k=2 to k=30, to see how the model behaves.
ks = np.arange(2, 31) # Values for k to try out.
models = np.vectorize(lambda k: cluster.KMeans(n_clusters=k, random_state=0))(ks) # Create model objects.
models = np.vectorize(lambda m: m.fit(dataset_winery[["price_normalized", "points_normalized"]]))(models) # Fit models.
c1_models_scores = np.vectorize(lambda m: m.score(dataset_winery[["price_normalized", "points_normalized"]]))(models) # Score models
plt.plot(ks, c1_models_scores, 'bx-')
[<matplotlib.lines.Line2D at 0x1b7e2b06340>]
Considering the fact there's no clear 'elbow' in the elbow plot, and no clear clusters in the previously constructed graph, it doesn't look very promising. We can look at a visualization at k=6 of the result to confirm:
final_model =models[4] # Take it out of array for convenience.
predictions = final_model.predict(dataset_winery[["price_normalized", "points_normalized"]])
plt.scatter(dataset_winery["price_no_outliers"], dataset_winery["points_no_outliers"], c=predictions, s=2)
<matplotlib.collections.PathCollection at 0x1b7e2bf7910>
Like we expected, not very useful. It's doubtful another algorithm would be able to make sense out of that cloud, so the way to go is probably to attempt to use the curse of dimensionality to our advantage. There is one problem: there isn't much numerical data available to use in our models. All the categorical values in the dataset are nominal, so that would mean using a get_dummies() construction to turn them into useful data. Unfortunately, all the nominal variables have too many permutations to realistically use, so we'll have to get creative.
In interesting metric could be the amount of wine reviews listed for a single winery. We'll create a new grouped by dataframe, with this column added, normalize it's columns, and visualize it.
dataset_reviews["winery"].value_counts()
Wines & Winemakers 222
Testarossa 218
DFJ Vinhos 215
Williams Selyem 211
Louis Latour 199
...
Clos Dady 1
Clos Fardet 1
Clos Junet 1
Clos Larcis 1
Hayward by Folin Cellars 1
Name: winery, Length: 16757, dtype: int64
dataset_winery_2 = dataset_reviews.groupby("winery").agg({"price_no_outliers": np.mean,
"points_no_outliers": np.mean,
"title": lambda np_a: np_a.size}) # Title is a random column
# Doesn't really matter what column
# We pick here
dataset_winery_2.rename(columns={"title": "review_amount"}, inplace=True) # Rename columns to better name
dataset_winery_2.head(5)
| price_no_outliers | points_no_outliers | review_amount | |
|---|---|---|---|
| winery | |||
| 1+1=3 | 18.333333 | 86.666667 | 6 |
| 10 Knots | 24.75 | 83.25 | 4 |
| 100 Percent Wine | 18.0 | 86.333333 | 3 |
| 1000 Stories | 19.0 | 90.5 | 2 |
| 1070 Green | 25.0 | 88.0 | 1 |
dataset_winery_2["price_normalized"] = normalize(dataset_winery_2["price_no_outliers"])
dataset_winery_2["points_normalized"] = normalize(dataset_winery_2["points_no_outliers"])
dataset_winery_2["reviews_normalized"] = normalize(dataset_winery_2["review_amount"])
dataset_winery_2.dropna(inplace=True)
fig = px.scatter_3d(data_frame=dataset_winery_2,
x='price_normalized',
y='points_normalized',
z='reviews_normalized')
fig.update_traces(marker={'size': 1, 'colorscale': 'Viridis', 'opacity': 0.8})
Visually, we can see there's still no sensible clusters to be found. There is no more numerical data, and creating even more from what we have would be fairly pointless. Unfortunately, we have to conclude that there's no useful insight to be gained from clustering on our current data, as there just aren't many numbers to work with.